Introduction

Text mining and analytics were conducted on a dataset of scraped Reliefweb.int (RW) articles on Ukraine. Articles were limited to the year 2022 and the English language. A total of 3,895 documents were scraped, tokenised and all stop words (the, a, we, can) were removed.

As can be seen from the scatterplot above, the types of actor – UN, NGOs, donors – are indicated by the colour of each circle. On the x-axis is the number of documents produced that year by each agency/contributor to reliefweb in 2022 (on a log scale). The y-axis shows th number of sectors/themes each agency contributed to. The number of documents produced is also shown by the size of each circle.


sources scatterplot Full-sized plot


Let us take a look at the most common word pairs inside the text of the scraped articles. Only more common word pairs have been included and the thickness of the line between them indicates the number of times this pair appears in the corpus. The results are not unanticipated.

If I knew nothing about the situation in Ukraine, from this graph, I can glean that there is a war and a humanitarian response to it. I can see sectors being mentioned (food, water, health, protection, education – is shelter apart because the government is handling it?). I can also see that there is displacement and refugees. The word million shows up, as does scale.


word pair network graph Full-sized graph


Let us, finally, take a macro-view of the dataset and plot, in a bit more detail, the correlations between word pairs – bigrams – within the corpus. So that we may get a lay of the land, so to speak.

This network graph is not only much more complex, but it is also formed of word pairs – bigrams – as this tends to improve interpretability at the cost of sensitivity. However, now the main patterns in the RW corpus are visible. This is the lay of the land, so to speak.


network graph kk full Download full-sized graph


Any bigram that appears here has at minimum occurred 50 times in the corpus and has at least a 0.15 correlation with at least one other bigram (that they can be found in the same document 15% of the time).

Immediately, we note that large proportions of RW are tied to conflict reporting (coming largely from OSCE and ACLED) as well as numerous releases from the IAEA. Both are imporant, but it would be far easier to analyse them separately, and, in the case of ACLED, it is much better to analyse the content of each conflict event’s description, as opposed to using the general weekly ACLED report that RW has been uploading – the issue there being words like Kazakhstan or Tibet come up because ACLED weekly reports cover all countries.

We can now see the full spectrum of rhetoric on Ukraine – respect international, humanitarian law, recognised borders, war crimes, and of course, human rights are front and centre. I wish these Europeans would be as concerned about human rights in other parts of the world as well.

But we also start to see certain keywords of high importance to persons in the industry, frequently appearing in the news: black sea grain initiative, humanitarian corridors and gender based violence.




Progression of the conflict

So, now that we can see grand strokes, of the corpus, how do we extract a bit more value from this dataset? One easy win is to look at this emergency through the lens of time – it is a hot conflict, after all.

Let’s start by looking at a simple GIF of conflict events across the whole of 2022, based on the ACLED dataset:


event type gif Full-sized GIF


So we can see that there are clear patterns to the violence. And whilst milbloggers definitely have much better analyses out there, but let us see if the RW corpus is responsive to the changing situation. Below, we have separated out bigrams by month, with the log-odds of a bigram appearing in a given month (over all other months) on the x-axis.


bigram months Full-sized plot


Based on RW rhetoric, the humanitarian response looks very responsive. Moving swiftly from reporting casualties to reporting needs and updating those calls when situation changes: from immunization programmes and hiv programmes to communicable disease, de-mining and nuclear war, finishing the year with winter assistance and widespread blackouts. It would be good to see how the response looked on the ground.




Sources

As demonstrated by the tightly-knit clusters of bigrams surrounding IAEA, ACLED and OSCE submissions, the source matters a lot. Furthermore, we would also dearly believe that each of the agencies we work for has its own specialities and particularities that make it our preferred actor.



Tf, tf-idf and log-odds

We will explore what each source has written by looking at:

  • Term frequency
  • Term frequency-indirect document frequency
  • Log-odds

Term frequency here shows us the phrases that each organisation uses most commonly. Often, their name or their mandate (“food assistance”) will appear in this list.

Tf-idf (term frequency-inverse document frequency) is another measure by which significant words are identified. The term frequency – the number of times a term appears in the corpus – is tempered by the inverse document frequency, which discounts words that tend to appear again and again (think “humanitarian assistance” or “including women”). The combined metric is useful for determining which words are common, but not too common. A suspicious-minded person might even say that this is where we may find what agencies really care about, as opposed to the boilerplate that they flood every report with.

Finally, we also evaluate words within a corpus by looking at their log-odds. What this means, for this section, is that the words appearing in under the log-odds are more likely to originate from that source over any other sources. This is where we see what unique information each source brings to the table.



Bigram plots by source

A series of plots has been generated for each source with more than 5 contributions to RW in 2022 about Ukraine. Each plot shows the top bigrams by term frequency, tf-idf and log-odds.

The organisations below are just a small selection of all bigram plots by source. All the plots may be viewed here.

If you see a bigram you’re curious about, just make use of the Bigram Search Helper. Just type the bigram or source or date into the relevant searchbox.



ACAPS

Let’s start with ACAPS, because it comes first alphabetically and most people would have read at least some of their material.

Overall, it seems that ACAPS is concerned foremost with access constraints, with the top 3 of its tf-idf list being rounded out by active ground conflict. ACAPS also shows which sources it prizes highly – REACHand UNHCR, likely influencing the placement of building materials at the top of the bigrams with the highest log-odds. Strangely, this is not reflected in UNHCR’s own list of bigrams.



Plan International


Plan International is a good example of a responsible investment. Their log-odds mentions a [multi-]sectoral response, as well as the statement [regardless of?] religion country age ability sexuality perceived differences, indicating a strong (or at least more vocal) commitment to human rights, though it could just be boilerplate (we’d have to check further down the Term frequency column). Other language such as provide holistic and initial service indicates fair programmatic thinking. All this makes Plan International a strong candidate if I were looking for a full-service NGO.



Malteser International

Much less informative are the documents originating from Malteser International. It just seems like a lot of press releases mentioning senior staff. Whilst there would definitely be much more detail in a report, I would also not really prioritise reading any of their products. Christian Aid is another such case.


European Commission


Changing lanes a bit, the EC’s rhetoric seems to be very consistent. The acceptance of hryvnia banknotes and solidarity lanes are clearly an important pieces of the response that the EC would like to highlight. And yes, these actions are laudable, but I also feel very irritated to have to ask why we can do this for white Europeans but not the Rohingya. We could have arranged for their gold to be sold at fair prices.

Otherwise, the EC cares about VET schools, vocational education and eu4skills protection. This also indicates a level of investment in human capital (after all, they might one day be EU citizens) that is absent in other parts of the world. Not to shame Europe, but just goes to show how useless ASEAN is.


Govt. of Ukraine


I’d also question the “curation” that has occurred with the Govt. of Ukraine’s statements on RW. I’m not really sure that this is what the Ukrainian government would like to communicate to the humanitarian community. Unless educational institutions are a cornerstone of the response, which funding indicates that they are not. The focus on sexual violence is less puzzling as the Ukrainian government might be signalling for humanitarian actors to move more fully into that space to complement state services or it could be calling the humanitarian community’s attention to particularly egregious war crimes.



USAID


Perhaps one of the most interesting things to check would be to see how well a donor’s rhetoric aligns with their actions and their funding disbursements. Looking at the term frequencies, one could charitably interpret them USAID’s main concerns as (in order), everything below gorf attacks does paint the picture of a typical donor. Does USAID mentioning health facilities more than food assistance mean that WHO will be getting more money than WFP? We’ll check FTS.



ACLED


Finally, let’s take a look at ACLED. It should be quite apparent why I would want to analyse ACLED conflict event descriptions on their own – what has been uploaded to RW are just ACLED’s weekly reports, which cover Ukraine, but other things as well, such as presidential elections, the overturn of roe as well as mentions of germany police. This corpus is polluted. Furthermore, it provides irrelevant data to persons who search up ACLED on RW – muddying the waters on one of main sources of incident tracking is quite unforgivable.

Download a bigram network graph based on ACLED conflict event descriptions, instead of on weekly collated reports. Immediately, one can see that the level of detail and utility is markedly different and the ACLED dataset contains much richer textual data than what has been uploaded to RW.




Bigram search helper

Below, bigrams are sourted by count in a searchable table.




Pairwise correlations between words

Search for any word pair to see which other words they are most highly correlated to: